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University of Arizona and Yale University 

DNA Copy number variation (CNV) has recently gained consid- 
erable interest as a source of genetic variation that likely influences 
phenotypic differences. Many statistical and computational methods 
have been proposed and applied to detect CNVs based on data that 
generated by genome analysis platforms. However, most algorithms 
are computationally intensive with complexity at least 0{n^), where 
n is the number of probes in the experiments. Moreover, the theo- 
retical properties of those existing methods are not well understood. 
A faster and better characterized algorithm is desirable for the ultra 
high throughput data. In this study, we propose the Screening and 
Ranking algorithm (SaRa) which can detect CNVs fast and accu- 
rately with complexity down to 0(n). In addition, we characterize 
theoretical properties and present numerical analysis for our algo- 
rithm. 

1. Introduction. The problem of change-points detection has been stud- 
ied by researchers in various fields including statisticians, engineers, econo- 
mists, climatologists and biostatisticians since 1950s. We refer to Bhat- 
tacharya (1994) for an overview of this subject. In this paper, we focus 
on a high dimensional sparse normal mean model 

(1.1) Yi = fii + ei, l<i<n, 

where Y = (Yi, . . . ,1^)"^ is a sequence of random variables, its mean /x = 
{ill, . . . , UnY^ is a piecewise constant vector and the errors Si are i.i.d. ~ 
A^(0,cj^). A change-point in this model is a position r such that 7^ /Ur+i- 
Assume that there are J change-points < ri < • • • < rj < n. We are inter- 
ested in the case when n is large, and J is small. The goal is to estimate the 
number of change-points, J, and the location vector r = (ti, . . . ,tj)'^ . Note 
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that all positions are potential candidates for a change-point, which makes 
the problem high dimensional and hard to solve. However, the small J im- 
plies the sparse structure of the model, which we can utilize to develop fast 
and efficient algorithms. 

1.1. Background. The main motivation of this project is to develop effi- 
cient statistical techniques for DNA copy number variation (CNV) detection, 
for which model (1.1) plays an important role, for example, Olshen et al. 
(2004), Huang et al. (2005), Zhang and Siegmund (2007), Tibshirani and 
Wang (2008), among others. The DNA copy number of a region is the num- 
ber of copies of the genomic DNA. CNV usually refers to deletion or duplica- 
tion of a region of DNA sequences compared to a reference genome assembly. 
Recent studies in genetics have shown that CNVs account for a substantial 
amount of genetic variation and provide new insights in disease association 
studies [Freeman et al. (2006), McCarroll and Altshuler (2007)]. Identifica- 
tion of CNVs is essential in the systematic studies of understanding the role 
of CNV in human diseases. Over the last decade, efforts have been made to 
detect DNA copy number variation with the help of significant advances in 
DNA array technology, including array comparative genomic hybridization 
(aCGH), single nucleotide polymorphism (SNP) genotyping platforms, and 
next-generation sequencing. We refer to Zhang et al. (2009) for a nice review 
of these technologies. These various platforms and techniques have provided 
a large amount of data that are rich in structure, motivating new statistical 
methods for their analysis. 

1.2. Current segmentation methods. Historically, the case that there is at 
most one change-point in model (1.1) has been intensively studied. See Sen 
and Srivastava (1975), James, James and Siegmund (1987), among others. 
The problem of single change-point detection is equivalent to the following 
hypothesis testing problem: 

Hq: fii = ■ ■ ■ = fj,n, against, 

(1.2) 

i^i : /ii = • • • = / /ij+i = •••=//„ for some j. 

Let Sfc be the partial sum of observations Yli=i When Var(li) is known, 
a commonly used test statistic is the log likelihood ratio statistic 

(1.3) r„= max UjSJn-S,f/[j{l-j/n)]}. 

l<j<n— 1 

The log likelihood ratio test is usually satisfactory although the distribution 
of Tn is quite complicated. We refer to Csorgo and Horvath (1997) for the 
theoretical analysis of the log likelihood ratio test. 

When there are multiple change-points, the problem is much more com- 
plicated. Here we briefly review several popular approaches, including an ex- 
haustive search with the Schwarz criterion [Yao (1988), Yao and Au (1989)], 
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the circular binary segmentation method [Olshen et al. (2004), Venkatraman 
and Olshen (2007)] and the ii penalization method [Huang et al. (2005), 
Tibshirani et al. (2005), Tibshirani and Wang (2008)]. 

To estimate J and r, an exhaustive search among all possibilities < 
J < n — 1 and < fi < • • • < f j < n can be applied, theoretically. Let r = 
(fi, . . . , fj)-^. For each candidate (J,t), the MLE for the variance, ^"^j 

can be calculated by the least squares method. Yao (1988), Yao and Au 
(1989) applied the Schwarz criterion, or the Bayesian Information Criterion 
(BIC), to estimate the number and locations of change-points and showed 
the consistency of the estimate. In particular, the number of change-points 
is estimated as 

(1.4) J = argmin — log (5"^; -|- J log n where a^? = minir^: 

Once J is determined, the location estimator is defined as 

T = argmin a'^-, . , . 

Unfortunately, when n is large, these estimators are not attainable due to 
the computational complexity. Braun, Braun and Miiller (2000) employed 
a dynamic programming algorithm to reduce the computational burden to 
the order of O(n^). But it is still computationally expensive for large n. 

An alternative approach is through the binary segmentation method, 
which applies the single change-point test recursively to determine all the 
change-points. As pointed out in Olshen et al. (2004), one of the drawbacks 
is that the binary segmentation can hardly detect small segments buried in 
larger ones. To overcome this problem, Olshen et al. (2004) introduced a Cir- 
cular Binary Segmentation (CBS) method and applied it for CNV detection. 
Lai et al. (2005) compared 11 algorithms on the segmentation in array CGH 
data and reported that CBS is one of the best. However, they found that 
CBS was also one of the slowest algorithms. The computational complexity 
of these recursive algorithms are at least O(n^). Venkatraman and Olshen 
(2007) proposed a faster CBS algorithm by setting an early stopping rule. 
The improved algorithm works much faster in practice without loss of much 
accuracy. 

Another alternative to exhaustive searching is the ii penalization ap- 
proach. Huang et al. (2005) studied the change-points detection problem 
via the following optimization problem: 

(1.5) minimize ||Y — subject to — jij+i] < s. 

j 

Tibshirani and Wang (2008) applied the fused lasso [Tibshirani et al. (2005)] 
for change-points detection, which 

minimizes ||Y — fi\\'^ subject to < si, \fj,j — < S2- 

j j 
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They added one more constraint since they assume that fx itself is a sparse 
vector. 

In fact, the simpler one of the above two, that is, (1.5), is equivalent 
to the lasso after a reparametrization = /ij+i — ^j, j = 1, . . . ,re — 1. The 
complexity of the fastest algorithm to solve the lasso is at least 0{v?) [Fried- 
man et al. (2007)]. See also recent developments on the fused lasso [Rinaldo 
(2009), Hoefling (2010), Zhang et al. (2010), Tibshirani and Taylor (2011)]. 

We consider all aforementioned methods as global methods, in the sense 
that they use all data points repeatedly in the process of determining change- 
points. Consequently, the algorithms are usually computationally expensive. 
To the best of our knowledge, the algorithms for the change-point detection 
that possess consistency properties have the computational complexities in 
the order of O(n^) or higher, although there are other algorithms that are 
shown to be fast numerically but whose theoretical properties in consis- 
tency and computational complexities are not well understood. As pointed 
out in Braun, Braun and Miiller (2000), the construction of efficient 0{n) 
algorithms which scale up to large sequences remains to be an important 
research topic. 

Last but not least, it is worth mentioning that the PennCNV [Wang et al. 
(2007)] is a fast and efficient algorithm to detect CNVs, although it is based 
on a different model from (1.1). 

1.3. An approach via local information. To detect the change-points, we 
need to examine the sequence both globally and locally. Heuristically, there 
should exist a neighborhood around a change-point that contains sufficient 
information to detect the change-point. For instance, it is unlikely for the 
value of lio,ooo to contribute much information for detecting a change-point 
around the position 100. Therefore, global methods mentioned in the last 
subsection may not be efficient. By focusing on a local region, we can reduce 
the computational burden, especially when the data set is huge. Based on 
this philosophy, we propose the Screening and Ranking algorithm (SaRa), 
which is a powerful change-point detection tool with computational com- 
plexity down to 0{n). 

Roughly speaking, the idea is to find a locally defined diagnostic mea- 
sure D to reflect the probability of a position being a change-point. By cal- 
culating the measure D for all positions, which we call the screening step, 
and then ranking, we can quickly find a list of candidates that are most likely 
to be the change-points. In fact, such diagnostic measures have been pro- 
posed and used in related literature, for example, Yin (1988), Gijbels, Hall 
and Kneip (1999). For example, at a point of interest x, a simple diagnostic 
function is the difference of the average among observations on the left-hand 
side and the right-hand side, that is, D{x) = Yl!t=i ^j+i-t/h — X^^Li Y'j+t/h. 
The larger |-D(a;)| is, the more likely that x is a change-point. Note that, to 
calculate D{x), we need only the data points near the position x. Therefore, 
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the strategy through a local statistic can save a lot of time, compared to 
global methods. Moreover, we will show that this fast algorithm is also ac- 
curate by theoretical and numerical studies. In particular, the SaRa satisfies 
the sure coverage property, which is slightly stronger than model consis- 
tency. 

To summarize, the local methods are usually more efficient than global 
ones to solve the change-points problem for high-throughput sequencing 
data, and it is feasible to evaluate their computational and asymptotic prop- 
erties. With these merits, we believe the local methods have been under uti- 
lized and deserve more attention and development in solving change-points 
problems. 

This paper is organized as follows. In Section 2 we introduce the screening 
and ranking algorithm to solve the change-points problem (1.1). In Section 3 
we study the theoretical properties of the SaRa and show that it satisfies 
the sure coverage property. Numerical studies are illustrated in Section 4. 
Proofs of theorems and lemmas are presented in the supplementary material 
[Niu and Zhang (2012)]. 

2. The screening and ranking algorithm. 

2.1. Model (1.1) revisited. In model (1.1), assume the mean of each vari- 
able in the jth segment is /3j, that is, Hi = f3j, for all integers i £ (Tj,Tj_|_i]. 
For clarity, the errors are assumed to be i.i.d. A^(0,it^) as before, although 
the normality may be relaxed. Note that in some of the existing work, the 
step function is defined as a left-continuous function on unit interval [0, 1], 
and change-points < ti < • • • < i j < 1 are discontinuities of fj,{x). Two no- 
tations are consistent with convention /i(-) = Hi and tj = Tj/n. 

For < J < oo, we refer to the Gaussian model (1.1) with J change- 
points as Aij. After a reparameterization of 13 j = /3o + J2j=i that is, 5j = 
j3j — j = 1, . . . , J, model A4 j has parameters 

6 = {Po,6i,...,6j,Ti,...,Tj,a^) 

to be determined. Let r = (ri, . . . ,Tn) and S = (5i, . . . , 6n) be the change- 
points location vector and jump-size vector, respectively. The purpose is to 
estimate J as well as the model parameter vector 6. Since n is large and 
J '^n, the variance u^, or an upper bound of u^, can be easily estimated. 
Once J and r are well estimated, [Sq and S can be estimated by the least 
squares method. In short, the main task is to estimate the number of change- 
points J and the location vector r. 

Roughly speaking, to detect the change-points in model (1.1), the SaRa 
proceeds as follows. In the screening step, we calculate at each point x a local 
diagnostic function D{x) which depends only on the observations in a small 
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neighborhood [x — h,x + h]. In the ranking step, we rank the local maximum 
of the function and the top J points fi, . . . ,fj will be the estimated 

change-points, where J can be determined by a thresholding rule or Bayesian 
information criterion. 

In the following subsections, we describe the SaRa in detail. 

2.2. Local diagnostic functions. We begin with the local diagnostic func- 
tion which is crucial in the SaRa. Roughly speaking, an ideal local diagnostic 
statistic at a position, say, x, is a statistic whose value directly related to 
the possibility that x is a change-point. For model (1.1), we first introduce 
two simple examples of a local diagnostic statistic. 

An obvious choice is D{x) = J2k=i^^+^~k/h — X]fc=i ^x+fc/^, for a fixed 
integer h [Yin (1988)]. It is simply the difference between averages of h 
data points on the left side and right side of x. Heuristically, consider the 
noiseless case in which Y = /x is a piecewise constant vector. D{x) is simply 
a piecewise linear function, whose local optimizers correspond to the true 
change-points. Obviously, the quantity of D{x) reflects the probability that x 
is, or is near to, a change-point. 

An alternative choice is the "local linear estimator of the first derivative," 
studied in Gijbels, Hall and Kneip (1999). Intuitively, if we use a smooth 
curve to approximate the true step function /x, the first derivative or slope 
of the approximation curve should be large around a change-point and near 
zero elsewhere. The approximation curve as well as its first derivative can 
be calculated easily by local polynomial techniques. Therefore, a local linear 
estimator of the first derivative is a reasonable choice for the local diagnostic 
statistic. 

In both examples above, the local diagnostic statistics are calculated as 
linear transformations of data points around the position of interest. We 
may consider a more general form of local diagnostic functions, which is the 
weighted average of l^'s near the point of interest x, 



To present the weight vector in a kernel, K{u), the weight can be cho- 
sen as Wi{x) = sgn(x + ^ — i) ■ Kh{i — x), where Kh{u) = h~^K{u/h). The 
equal-weight case (2.1) is a special case corresponding to the uniform kernel 
K{u) = \ ■ l{\u\<i}- 



n 



D{x,h) = y2wi{x)Yi, 
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For the second example, at the point of interest x, the local linear esti- 
mator of the first derivative is 

n 

D{x,h) = '^WiYi, 

i=l 

where Wi = g„ oSn\-5^ ^ Kh{i- x) [Sn.o(^ -x)- Sn,i], and Sn,i = Y.7=i^h{i- 

x){i — x)\l = 0, 1,2. Note that this is simply the local linear estimator for 
the first derivative with kernel K{u) and bandwidth h. See Fan and Gijbels 
(1996) or Gijbels, Hall and Kneip (1999). In the supplementary material 
[Niu and Zhang (2012)], a class of more general local diagnostic statistics is 
introduced. 

In all numerical studies within this paper, we use only the equal- weight 
diagnostic function (2.1) for properly chosen /i, because of the piecewise 
constant structure of the mean /x. In Section 2.5 we discuss the choices of 
weight and bandwidth in detail. From now on, we denote the diagnostic 
function by D{x, h) when h is specified. When the context is clear, we may 
simply use D{x). 

2.3. The screening and ranking algorithm. Given an integer h, the local 
diagnostic function D(x, h) can be calculated easily. Obviously, a large value 
of \D{x,h)\ implies a high probability of x being or neighboring a change- 
point. Therefore, it is reasonable to find and rank the local maximizers of 
\D{x,h)\. To be precise, we first define the /i-local maximizer of a function 
as follows. 

Definition. For any number x, the interval [x — h,x + h) is called the 
/i-neighborhood of x. And, x is an /i-local maximizer of function /(•) if / 
reaches the maximum at x within the /i-neighborhood of x. In other words, 

f{x) > f{x') for all x' G (x — /i, X -|- /i). 

Since we always consider the /i-local maximizer of \D[x,h)\, we may 
omit h when the context is clear. Let CM be the set of all local maximizers 
of the function \D{x, h)\. We select a subset = {fi < f2 < • • • < f j} C CM 
by a thresholding rule 

(2.2) \D{f,h)\>\. 

The index set, J , and J are the SaRa estimators for the locations and 
the number of change-points, respectively. A choice of the threshold A by 
asymptotic analysis is given in Section 3. An alternative way to determine J 
is by a Bayesian-type information criterion [Yao (1988), Zhang and Sieg- 
mund (2007)]. By ranking local maximizers of ]D(x)], we get a sequence xi, 
X2, . . . with ]D(xi)] > ]L'(x2)l > ■ ■ ■• Plugging the first J Xj's into 

BIC(J) = ^log<T5 + Jlogn, 
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where a'j is the MLE of the variance assuming xi, . . . ,xj are change-points. 

Then J = argminBIC( J) is the estimate for the number of change-points and 
(fi , . . . , f j) = (rE(;^) , . . . , x^j-^ ) is the estimate for the location vector, where 
are the order statistics of xi,...,Xj. The procedure for the modified BIC, 
proposed in Zhang and Siegmund (2007), is the same as the BIC except 
using 

n 3 ~ 1 ■^^^ 

mBIC(J) = - log(Tj + - Jlogn + -^log(3;(i)/n-X(j_i)/n), 

i=l 

where x^q-j = and = n. 

2.4. Computational complexity. One of the advantages of the SaRa is its 
computational efficiency. To calculate the SaRa estimator, there are three 
steps: calculate local diagnostic function D, find its local maximizers, and 
rank the local maximizers. For the equal-weight case, it takes 0(n) opera- 
tions to calculate D, thanks to the recursive formula 

Dh{x + 1) = Dh{x) + (2y,+i - Y,_h+^ - Y,+h+i)/h. 

Moreover, 2n comparisons are sufficient to find local maximizers of D. Note 
that there are at most n/h of /i-local maximizers by definition, which implies 
that only O(^log^) operations are needed to rank these local maximizers. 
As h is suggested to be O(logn) by asymptotic analysis, log ^) = 0{n). 
In fact, by the thresholding rule, we may not have to rank all the local 
maximizers, which can save more time. Altogether, it takes 0(n) operations 
to calculate the SaRa estimator. 

For the general case, the computational complexity is 0{nh) = O(nlogn). 
Therefore, the SaRa is more efficient than the 0{'n?) algorithms. 

2.5. The multi-bandwidth SaRa. In the SaRa procedure, involved are 
several parameters such as bandwidth h, weight w and threshold A. Although 
we give an asymptotic order of these quantities in Section 3, it is desirable 
to develop a data adaptive technique to determine their values. 

Similar to other kernel-based methodologies, the choice of the kernel func- 
tion, or the weight, is not as crucial as the bandwidth. In this paper, since we 
consider model (1.1) with piecewise constant mean fi, the equal-weight (2.1) 
performs well. In some potential applications, when the mean fi is piecewise 
smooth, the local linear estimator of the first derivative is more appropriate. 

Choosing the optimal bandwidth is usually tricky for algorithms involving 
bandwidths. To better understand the bandwidth selection problem, let us 
look at the local diagnostic statistic D{x) more carefully. In fact, D(x) is 
just a statistic used to test the hypothesis 

i^o '■ there is no change at x vs Hi : there is a change at x. 
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Consider the simplest case when there is no other change-point nearby. We 
have 



where 5 is the difference between the means on the left and right of x. 
Obviously, the larger the h is, the more powerful the test is. Therefore, when 
applying the SaRa to detect change-points, we prefer to use long bandwidth 
for a change-point at which the mean function jumps only slightly. However, 
when using a large bandwidth, there might be other change-points in the 
interval (x — h,x + h). Then (2.3) does not hold anymore. To lower this 
risk, we prefer small bandwidth, especially at a point where the jump size is 
large. In summary, the bandwidth is of less concern when the change-points 
are far away from each other and the mean shifts greatly at each change- 
point. Otherwise, we need to choose the bandwidth carefully. Moreover, the 
optimal bandwidths for various positions may be different. It is not an easy 
task to determine the optimal bandwidth for each position when we do not 
have any prior knowledge of change-points. 

The choice of the threshold A is easy if we know in advance that the 
jump sizes are more than a constant c for all change-points or we target 
on only the change-points where the mean shifts tremendously. Otherwise, 
the choice can be tricky. If we do not use a uniform bandwidth for all 
points, we may have to use different thresholding rules for different posi- 
tions. 

From the discussion above, we see that it is necessary to develop a data- 
adaptive method for parameter selection to enhance the power of the SaRa. 
Here we propose the multi-bandwidth SaRa, which can choose the band- 
width and threshold implicitly and data-adaptively for each change-point. 
The procedure of multi-bandwidth SaRa is as follows. In Step 1, we select 
several bandwidths, say, hi, ... , hx, and run the SaRa for each of these band- 
widths. We can use a conservative threshold, say, = Cy^2/hk^ with C = 2 
or 3. Note that \D(x,hk)\, k = 1, . . . , K , typically have different local max- 
imizers, especially when the signal-to- noise ratio is small. We collect these 
SaRa estimators which constitute a candidate pool with a moderate size. In 
Step 2, we may apply the best subset selection with a BIC-type criterion 
[Yao (1988), Zhang and Siegmund (2007)] to the candidate pool. For exam- 
ple, (1.4) can be used to estimate the number and locations of change-points. 
That is, the candidate pool has a much smaller size than n and can cover 
the true change-points in the sense of Theorem 1 in Section 3. In practice, 
we recommend a forward stepwise selection or backward stepwise deletion 
instead of the best subset selection when the candidate pool is still big. For 
instance, to employ the backward stepwise deletion method, we delete the 
elements one by one from the candidate pool. At each step, we delete the 
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one which leads to the least increase for in (1.4). The procedure stops 
if ^ log (7 J + J log n stops decreasing, where J is the size of the remaining 
candidate pool. 

We may consider that the multi-bandwidth SaRa is a combination of lo- 
cal and global methods. In the first step, the fast local method is employed 
to reduce the dimensionality. In the second step, the global method is ap- 
plied for accurate model selection. Moreover, the second step also serves 
as a parameter selection tool for the SaRa. For example, imagine the case 
that J = 100 is a change-point, and the SaRa with bandwidths h = 10, 15 
and 20 suggests 97, 99 and 105 as the estimates of the change-point, respec- 
tively. The second step of the multi-bandwidth SaRa, that is, the subset 
selection procedure with the BIC-type criterion, might suggest 99 as the 
change-point. In this case, the bandwidth is selected implicitly as 15. As 
a result, the multi-bandwidth SaRa behaves like the SaRa with the opti- 
mal bandwidth at each change-point, which is also verified by the numerical 
studies. Note that we can use the strategy of the multi-bandwidth SaRa for 
a single bandwidth h, which usually improves the performance. The reason 
is that the SaRa arranges the location estimators in the order of magnitudes 
of the local diagnostic functions, while the multi-bandwidth SaRa can re- 
arrange the order using the best subset selection by making use of global 
information, and, hence, the multi-bandwidth SaRa is more reasonable in 
most cases. In this procedure, the threshold parameter is selected implicitly 
for each position by the subset selection procedure. 

The computation cost of Step 2 in the multi-bandwidth SaRa depends on 
the size of the candidate pool, which is highly related to the true number 
of change-points J. With J <^n, the size of the candidate pool can be well 
controlled by setting appropriate threshold Afc. Thanks to the sequential 
structure of observations, the backward stepwise selection can be employed 
for subset selection efficiently. In applications when n is large. Step 2 will 
not increase the computation time significantly. 

In practice, we still have to choose bandwidths hi,. . . , hx for the multi- 
bandwidth SaRa, although their values are not as crucial as the one in 
the original SaRa. The choice of these bandwidths should depend on the 
applications, in which certain information may be available. If not, a de- 
fault setting of K = 3, hi = logn, /i2 = 21ogn and h^ = 31ogn is recom- 
mended. 

3. Sure coverage property. The main purpose of this section is to show 
that the SaRa satisfies the sure coverage property. In other words, the union 
of intervals J ^h selected by the SaRa includes all change-points with prob- 
ability tending to one as n goes to infinity. The nonasymptotic probability 
bound is shown as well. It will also be clear that the sure coverage property 
implies model consistency of the SaRa. 
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Let J = J{n) be the number of change-points in model (1.1). We denote 
the set of all change-points hy J' = {ri < r2 < • • • < tj}. We assume tq = 
and Tj+i = n for notational convenience. For simplicity, we assume that 
the noises are i.i.d. Gaussian with variance o"^. We will use the equally 
weighted diagnostic function (2.1) with bandwidth h in this section, but all 
theorems can be easily extended for all weights in the family W, defined 
in the supplementary material [Niu and Zhang (2012)]. We write = Jh,\^ 
defined by the thresholding rule (2.2), which depends on the choices of the 
bandwidth h and the threshold A. 

Define b = mini<j<j \6j\, L = mini<j<j+i(Tj — Tj-i). Intuitively, when 5 
and L are too small, no methods can detect all change-points. A key quantity 
for detecting change-points is S'^ = 6^L/a'^. We assume 

(3.1) = 5^L/a^ >32\ogn. 

Theorem 1. Under Assumption (3.1), there exist h = h{n) and A = 
A(n) such that J = Jh^\ = {fi, . . . , f j} satisfies 

(3.2) lim P({ J = J}r\{JCl:J± h}) = 1, 

n— >oo 

where by J C: j ± h we mean Tj G {fj — h, fj + h) for a// j G {1, . . . , J} . In 
particular, taking h = L/2 and A = 5/2, we have 

(3.3) ¥{{J = J] r\{J d-.J ±h]) > 1-85"^ exp{logn- 5^32}. 

Remark 1. We believe Theorem 1 can be generalized to the non-Gaussian 
case, although we may need a stronger condition than (3.1) in order to bound 
the tail probability by Bernstein's inequality. However, the theoretical re- 
sults are likely to be more complicated. 

Remark 2. We take h = L/2, \ = 5/2 and constant 32 in Assump- 
tion (3.1) for clearer demonstration of our proof. Obviously the optimal 
constant in Assumption (3.1) to make (3.2) hold depends on the choice 
bandwidth h, threshold A as well as the usually unknown true data gener- 
ating process. 

Remark 3. If L = o(n) [e.g., in Assumption (3.1), 6 and a are con- 
stants, L is of order logn], by Theorem 1, with probability tending to 1, we 
have J = J and — rj| < h. In particular, |fj/n — Tj/n|— t-O, which implies 
consistency [in the sense of Yin (1988), Yao and Au (1989)] of the SaRa. In 
fact, the conclusion of Theorem 1 is slightly stronger than model consistency 
since it shows |fj/n — Ti/n\ — )• uniformly with an explicit rate. Moreover, in 
the conventional asymptotic analysis, it is usually assumed that J is fixed 
and Ti/n ^ ti is constant for each i, as n goes to infinity, which implies 
L/n — )■ constant. Our asymptotic setting (3.1) is more challenging but it 
incorporates L, 6 and n better. 
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Fig. 1. Power comparison of the three tests under different Jump size to Standard de- 
viation Ratios (JSR), change-point locations (Lo) and type I errors (a). Black solid, red 
dashed and blue dotted lines correspond to the likelihood ratio test, and SaRa with band- 
widths 10 and 15, respectively. The results are based on 10,000 replications. 

4. Numerical results. 



"■^ in hypothesis testing prob- 
of the diagnostic function 



4.1. Simulation I: Detecting for a single change-point. Although the SaRa, 
as a change-point detecting tool, is designed for large data sets with mul- 
tiple change-points, we may also compare its power with classical methods 
in the conventional setting. In this subsection, we consider the hypothesis 
testing problem (1.2) (with a known variance) and compare the SaRa with 
the likelihood ratio test (1.3). 

We assume Yi ~ N{fii,a'^) with known cr' 
lem (1.2). Let n = 100. The maximum 
maxi<2^^<99 |D(rE, /i)| can serve as a test statistic and can be viewed as the 
simplest version of SaRa. We consider only the equal weight case (2.1) for 
simplicity. We compare three test statistics, the log likelihood ratio test 
statistic (1.3) and the SaRa with bandwidths h = 10 and h = 15. We run 
10,000 replications to get distributions of the three test statistics under the 
null hypothesis. We control the type I error to the levels a = 0.05 and 0.01, 
respectively, and examine the powers of the three tests when the change- 
point locations are 10, 30, 50 and uniformly distributed, respectively. 

In Figure 1, we see that the power increases as the Jump size to Standard 
deviation Ratio (JSR) increases. The likelihood ratio test is more powerful 
when the change-point location is near the middle and less powerful oth- 
erwise. The tests using the SaRa are robust with respect to the location 
since the diagnostic function is locally defined. It is not surprising that the 
likelihood ratio test performs better overall since the SaRa-based tests use 
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Table 1 

The estimated model sizes J and Sure Coverage Probabilities (SCP) of SaRa by the 
thresholding rule under different settings {n,L) and noise levels a. Column 3 lists the 
distribution and mean value of the estimated number of change-points. Columns 4 o-rid 5 
list SCPs of two change-points as well as mean distance between estimated change-point 
locations and true locations. The results are based on 1000 replications 







Number of change-points 


Chcinge-point 


Change-point 


(n,L) 


cr 


J = 2 


<2 


>2 


Mean 


1 SCP (Mean) 


2 SCP (Mean) 


(400, 12) 


0.5 


63.5% 


11.7% 


24.8% 


2.175 


91.3% (0.756) 


91.3% (0.716) 




0.25 


98.2% 


1.8% 


0.0% 


1.980 


98.9% (0.129) 


99.1% (0.119) 


(3000, 16) 


0.5 


60.3% 


8.3% 


31.4% 


2.306 


92.8% (0.814) 


93.4% (0.776) 




0.25 


98.1% 


1.9% 


0.0% 


1.980 


99.3% (0.118) 


98.7% (0.129) 


(20,000, 20) 


0.5 


60.2% 


6.3% 


33.5% 


2.343 


94.3% (0.862) 


94.8% (0.841) 




0.25 


99.3% 


0.7% 


0.0% 


1.993 


99.5% (0.139) 


99.8% (0.108) 


(160,000, 24) 


0.5 


49.5% 


5.0% 


45.5% 


2.599 


95.8% (0.877) 


95.0% (1.013) 




0.25 


99.5% 


0.5% 


0.0% 


1.995 


99.8% (0.096) 


99.7% (0.148) 



information only within a small neighborhood. However, we see that the 
SaRa-based tests perform well when the JSR is greater than 1.5, and the 
SaRa with h = 15 outplays the likelihood ratio test when the locations are 
far from the middle or random. 

4.2. Simulation II: Sure coverage property. In this subsection we test the 
SaRa on a simple but challenging example. Consider the situation when there 
is a small CNV with length L buried in a very large segment with length n, 
where L = O(logn). Explicitly, we assume the true mean vector ijl = E(Y) 
satisfying //j = + (5 • I{n/2<i<n/2+L}- other words, among all n positions, 
there are two change-points located on the position n/2 and n/2 + L with 
jump sized 5 and —5. The locations and the number of aberrations do not 
influence our method. We set {n,L) = (400,12), (3000,16), (20,000,20) and 
(160,000, 24), which satisfy approximately L ~ 21ogn. We fix the jump size 
5 = 1 and assume that the noises are i.i.d. A/'(0,cj^) with a = 0.5 and 0.25, 
so S"^ ~ 81ogn and 32 log n respectively. We run the SaRa with the thresh- 
olding rule (2.2), taking h = |L and A = |(^. We list the simulation results 
in Table 1, based on the 1000 replications. We see that the SaRa can esti- 
mate the number of change-points as well as their locations accurately in 
the less noisy case. The two change-points can be detected with very high 
probabilities and the estimated errors are very small with the average below 
one and median at zero. Even in the noisy case, the two points can be de- 
tected with very high probabilities, and the number of the false discovered 
change-points is very small, at most 1 in most cases. The simulation results 
match our theory perfectly. 
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Fig. 2. Above: example of one Monte Carlo sample of simulation model (4-1) with 
a = 0.2, a = 0; bottom: the local diagnostic functions with bandwidth h = 9, with noise 
(blue dashed line) and without noise (red solid line). 

4.3. Simulation III: An example in Olshen et al. (2004)- This simulation 
example, adapted from [Olshen et al. (2004)], is more complex and realistic. 
The true data generating process is as follows: 

(4.1) Yi = fii + 0.25a sm{aTTi) + £i, i = l,...,n, 

where the error term e follows Gaussian distribution A^(0,cj^). The total 
number of markers n equals 497. There are six change-points along fi with 
position T = (137,224,241,298,307,331), = -0.18, and S = (0.26,0.99, 
— 1.6,0.69, —0.85,0.53). A sinusoidal trend was added to mimic the periodic 
trend found in the a-CGH data. The noise parameter a was set to be one of 
0.1 or 0.2, and the trend parameter a £ {0,0.01,0.025} corresponds to none, 
long and short trends, respectively. A simulated data set with a = 0.2 and 
a = is illustrated in Figure 2. Among the six change-points, the ones at 
137, 298 and 307 are more difficult to detect, since the jump size at 137 is 
small, and the length of CNV between 298 and 307 is quite short. 

We applied a few methods to the 1000 simulated data sets of 497 points. 
The first method is the fast CBS algorithm by package DNAcopy [Venkatra- 
man and Olshen (2007)]. Since CBS is quite sensitive and may lead to many 
false positives, we also tried the two-stage procedure CBS-SS, which em- 
ploys subset selection (SS) with modified BIC [Zhang and Siegmund (2007)] 
to delete false positives. The second method is the SaRa. We tried a few dif- 
ferent bandwidths {h = 9, 15 and 21) to test its performance. The model 
size was determined by the modified BIC, described at the end of Sec- 
tion 2.3. We also applied the multi-bandwidth SaRa (m-SaRa). For the 
m-Sara, we first used three bandwidths h = 9, 15 and 21 and chose the 
threshold A = 2y^2/7i • (t for each h to select candidates. Then we applied 
the backward stepwise deletion to get our final estimate from the set of 
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Table 2 

Number of change-points detected by the CBS algorithm and SaRa 



Number of change-points 



a 


Trend 


Method 


<5 


6 


7 


8 


>8 


0.2 


None 


fast CBS 





872 


88 


39 


1 


0.2 


None 


CBS-SS 


2 


998 











0.2 


None 


SaRa {h = 9) 


166 


639 


150 


36 


9 


0.2 


None 


SaRa (ft = 15) 


42 


901 


51 


6 





0.2 


None 


SaRa {h = 21) 


157 


833 


10 








0.2 


None 


m-SaRa 





998 


2 








0.2 


Short 


fast CBS 





678 


194 


101 


27 


0.2 


Short 


CBS-SS 


9 


991 











0.2 


Short 


SaRa {h = 9) 


220 


584 


156 


37 


3 


0.2 


Short 


SaRa (ft = 15) 


100 


780 


107 


10 


3 


0.2 


Short 


SaRa (ft = 21) 


350 


586 


60 


4 





0.2 


Short 


m-SaRa 





992 


8 








0.2 


Long 


fast CBS 


1 


695 


148 


135 


21 


0.2 


Long 


CBS-SS 


6 


991 


3 








0.2 


Long 


SaRa (ft = 9) 


263 


597 


121 


15 


4 


0.2 


Long 


SaRa (ft = 15) 


101 


840 


53 


6 





0.2 


Long 


SaRa (ft = 21) 


317 


669 


14 








0.2 


Long 


m-SaRa 





960 


40 









candidates. (Applying the best subset selection would be better but a lit- 
tle slower. There are about 20 candidates.) The results for the noisy case 
(cj = 0.2) are illustrated in Table 2, which lists the number of change-points 
detected by these methods under different trends. We did not assume the 
noise variance is known, so all the variances in different scenarios were es- 
timated. To estimate the variance, we first estimated the mean //j for each 
point using a local constant regression. Then the variance can be estimated 
by the residual sum of squares divided by n. We omitted the less noisy case 
(7 = 0.1 since all methods performed well except the fast CBS, which pro- 
duced a few false positives. For a = 0.2, the SaRa may underestimate the 
number of change-points when the bandwidth is too large or too small. The 
SaRa with bandwidth /i = 15 is better than CBS. /i = 9 or /i = 21 is not op- 
timal since it is hard to detect the first change-point with small bandwidth 
and it is difficult to detect the CNV between 298 and 307 with too large 
bandwidth. However, large bandwidth helps a lot for the first change-point 
and small bandwidth gives a more accurate estimate for the fourth and fifth 
change-points. It is not surprising that the multi-bandwidth SaRa, which 
combines the power of all bandwidths, performs best in this complex and 
noisy example. We also observed that the two-stage procedure CBS-SS can 
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Table 3 

Detection rates for each change-point and Average Falsely Discovered (AFD) Count 



Trend 


Method 


CPl 


CP2 


CPS 


CP4 


CPS 


CP6 


AFD 


None 


CBS-SS 


92.8% 


100% 


100% 


99.8% 


99.8% 


99.6% 


0.076 


None 


m-SaRa 


90.6% 


100% 


100% 


99.9% 


100% 


100% 


0.097 


Short 


CBS-SS 


80.9% 


100% 


100% 


99.1% 


99.1% 


100% 


0.191 


Short 


m-SaRa 


83.0% 


100% 


100% 


99.9% 


100% 


100% 


0.179 


Long 


CBS-SS 


81.2% 


100% 


100% 


99.7% 


99.8% 


97.1% 


0.217 


Long 


m-SaRa 


87.1% 


100% 


100% 


99.9% 


100% 


99.8% 


0.172 



improve the CBS by deleting false positives. The performances of CBS-SS 
and m-SaRa are comparable. 

In Table 3 we report the detection rate and average falsely discovered 
count for the CBS-SS and m-SaRa. We use an aggressive criterion which 
considers the change-point undetected if the estimation error is above 5. 
A location estimator is falsely discovered if there is no true change-point 
within distance of 5. The detection rates of two methods are similar. In the 
cases with trends, the multi-bandwidth SaRa is slightly better. 

4.4. Application to Coriel data. To test the performance of our algo- 
rithm, we apply it to the Coriel data set which was originally studied by 
Snijders et al. (2001). This is a typical aCGH data set, which consists of 
the log-ratios of normalized intensities from the disease vs control samples, 
indexed by the physical location of the probes on the genome. The goal is 
to identify segments of concentrated high or low log-ratios. This well-known 
data set has been widely used to evaluate CNV detection algorithms; see 
Olshen et al. (2004), Fridlyand et al. (2004), Huang et al. (2005), Yin and 
Li (2010), among others. The Coriel data set consists of 15 fibroblast cell 
lines. Each array contains measurements for 2275 BACs spotted in tripli- 
cates. There are 8 whole chromosomal alterations and 15 chromosomes with 
partial alterations. All of these aberrations but one (Chromosome 15 on 
GM07801) were confirmed by spectral karyotyping. 

The outcome variable used for the SaRa algorithm was the normalized 
average of the log2 ratio of test over reference. Note that in this data set 
there are only three possible mean values corresponding to loss (monosomy) , 
neutral and gain (trisomy). However, we did not assume we know this fact 
in advance. We applied the multi-bandwidth SaRa with /i = 9, 15 and 21 di- 
rectly to the data set, and used the backward stepwise deletion with modified 
BIC to select change-points from all candidates suggested by the SaRa. As 
a result, the multi-bandwidth SaRa identified all but two alterations (Chro- 
mosome 12 on GM01535 and Chromosome 15 on GM07081). For Chromo- 
some 12 on GM01535, the region of alteration is represented by only one 
point and was hence difficult to detect. For Chromosome 15 on GM07081, 
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Fig. 3. Estimated mean functions by the multi-bandwidth SaRa on four cell lines. 



our result agrees with Snijders et al. (2001) that there is no evidence of an 
alteration from the data set. Our method also found a few alterations that 
were not detected by spectral karyotyping. The multi-bandwidth SaRa sug- 
gests that there might be CNVs on chromosomes 7, 11, 21 for GM00143, 
chromosome 8 for GM03134, chromosomes 7, 21 for GM02948, and chromo- 
some 11 for GM10315 (see Figure 3). And most of these additional CNVs 
are short aberrations. They may be false positives or true ones that cannot 
be confirmed by spectral karyotyping due to its low resolution. We also tried 
fast CBS to analyze this data set and found that CBS selected more than 
100 change-points. The result has been listed in Yin and Li (2010), so we 
omit it here. 

4.5. Application to SNP genotyping data. Due to the low resolution, 
aCGH techniques are effective to detect only long CNVs of tens or hun- 
dreds of kilobases. By recent studies [Conrad et al. (2006), McCarroU et al. 
(2006)], it has been shown that short CNVs are common in the human 
genome. The genome-wide SNP genotyping arrays, which are able to assay 
half a million SNPs, improve the resolution greatly and offer a more sensitive 
approach to CNV detection. 

We illustrate the SaRa using SNP genotying data for a father-mother- 
offspring trio produced by the Illumina 550K platform. The data set can 
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Table 4 

Number of change-points detected for the offspring by different methods 





SaRa 


m-SaRa 


CBS 


FL 


PennCNV 


Chromosome 3 


2 


19 


46 


511 


2 


Chromosome 11 


4 


2 


29 


345 


4 


Chromosome 20 


4 


7 


16 


143 


2 



be downloaded along with the PennCNV package. The data consist of the 
measurements of a normaHzed total signal intensity ratio called the Log R 
ratio, that is, calculated by Iog2(-Robs/-Rcxp)5 where Rohs is the observed total 
intensity of the two alleles for a given SNP, and i?exp is the expected intensity 
computed from linear interpolation of the observed allelic ratio with respect 
to the canonical genotype clusters [Peiffer et al. (2006)]. For each subject, 
the Log R ratios along Chromosomes 3, 11 and 20 are included in the data 
set. There are 37,768, 27,272 and 14,296 SNPs in Chromosomes 3, 11 and 
20, respectively. Similar to the aCGH data, the segments with concentrated 
high or low Log R ratios correspond to gains or losses of copy numbers. 

We employed the SaRa, multi-bandwidth SaRa, CBS, the fused lasso (FL) 
and PennCNV to detect CNVs in this data set. We used R packages DNAcopy 
(version 1.14.0), cghFLasso (version 0.2-1) and free software PennCNV for the 
last three algorithms. The SaRa and m-SaRa were implemented by the R 
program. For the SaRa, we took /i = 10 and A = 2-v/logn • \/2/h ■ a. For the 
m-SaRa, we used three bandwidths h = 10, 20 and 30 and the threshold A = 
3y^2//i • a in the first step. The backward stepwise selection with modified 
BIC was employed in the second step. 

The result for the offspring is listed in Table 4. The fused lasso and CBS 
detected too many change-points, most of which are most likely to be false 
positives. The performance of the SaRa and PennCNV are similar. In par- 
ticular, the SaRa can identify all 4 CNVs found by PennCNV as well as 
an additional one on Chromosome 20 from position 5,851,323 to 5,863,922 
kilobase. It seems that the m-SaRa did not perform as well as the SaRa in 
this example. The reason is two-fold. First of all, the signal to noise ratio 
(i.e., jump size to standard deviation ratio) is large for this data set. There- 
fore, the simple SaRa works very well. Second, in the second step of the 
m-SaRa, the modified BIC was employed for model selection. The modified 
BIC assumes a uniform prior over the parameter space [Zhang and Sieg- 
mund (2007)], which may not be satisfied since the CNVs are very short in 
this example. 

All computations were done on a 3.33 GHz Intel(R) Core(TM)2 Duo 
PC running the Windows XP operating system. It took about 16 s for the 
PennCNV to show the result for all 3 subjects. So the computation time for 
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Table 5 

Computation time (in seconds) for different methods 





Number of SNPs 


SaRa 


m-SaRa 


CBS 


FL 


Chromosome 3 


37,768 


1.63 


8.93 


63.28 


3.10 


Chromosome 11 


27,272 


1.16 


2.64 


34.05 


1.79 


Chromosome 20 


14,296 


0.61 


1.67 


17.37 


0.66 



each subject is about 5 s. All other algorithms were operated in R software. 
The computation time for each algorithm is listed in Table 5. All algorithms 
are fast and practical. In particular, the SaRa is one of the fastest. Moreover, 
we observed that the computation time for the SaRa increases linearly with 
the sample size. 

Overall, the SaRa and PennCNV are the best among all algorithms. The 
fused lasso and CBS mainly target on the aCGH data and may not be 
suitable for the SNP genotyping data. The m-SaRa equipped with a proper 
Bayesian-type information criterion is potentially useful. In the SaRa, we 
used /i = 10 since it is known that the CNVs are quite short. The thresholds 
were chosen by asymptotic analysis of the extremal values of D{x). In fact, it 
was quite obvious when the local maximizers were ranked. For the offspring, 
there were only 10 local maximizers at which the values of were larger 

than 0.57. The values of |-D(x)| at other local maximizers were less than 0.26. 

It is surprising that the SaRa and PennCNV can give similar results since 
they are based on different models. The advantage of PennCNV is that it 
can utilize more information, that is, the finite states of copy numbers and 
the B Allele Frequency (BAF) besides the Log R ratio. The value of the 
SaRa is its simplicity and generality. The SaRa can be implemented easily 
and is potentially useful in other multiple change-points problems. 

5. Discussion. Motivated by copy number variation detection, many mul- 
tiple change-point detection tools have been invented and developed recently. 
However, faster and more efficient tools are needed to deal with the high 
dimensionality of modern data sets. Different from other approaches, we 
propose a screening and ranking algorithm, which focuses on the local infor- 
mation. It is an extremely efficient method with computational complexity 
down to 0{n) and suitable for huge data sets. Moreover, it is very accu- 
rate and satisfies the sure coverage property. Note that, as far as we know, 
very few theoretical results have been developed for multiple change-points 
detection tools. Besides the efficiency and accuracy, the SaRa is easily im- 
plementable and extendable. For example, since only local information is 
involved in the computation, it is easy to extend it to an on-line version, 
which may be useful for financial data. In addition, we may use the SaRa 
for the heteroscedastic Gaussian model when the variance is not constant. 
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We can estimate the variance using local information and take it into ac- 
count when calculating local statistics D(x,h). We should note, however, 
that both the implementation and theory for the heteroscedastic variances 
will be more complicated and will require further extensive effort. In the 
SaRa procedure, the choices of bandwidth h and threshold A are important. 
In applications, optimal choices of these parameters may depend on the po- 
sitions and jump-sizes of change-points. In this case, the multi-bandwidth 
SaRa, which selects parameter implicitly and data-adaptively, is practically 
useful. Knowledge about the data can be particularly useful in determining 
reasonable or meaningful gaps between change-points and and jump-sizes at 
change points, which can prevent us from using poor choices of bandwidth h 
and threshold A. 

The SaRa is a useful method for solving change-points problems for a high 
dimensional sparse model such as (1.1) and can be generalized to solve more 
general change-point problems. However, we should point out that the SaRa 
can be improved for CNV detection since model (1.1) may not capture all 
the characteristics of the CNV problem. For example, CNVs can be short. 
Then, it is better to modify the SaRa accordingly to improve its power. 

SUPPLEMENTARY MATERIAL 

A description of general weight functions and technical proofs. (DOI: 
10.1214/12-AOAS539SUPP; .pdf). The pdf file contains a description of 
general weight functions and the proof of Theorem 1 . 
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